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Errata 


(a) Equation (16) should read 

tl, (A) = { yeE m (A) : P, f y (y | H 1 , A) 

Pj f y (y | H*. A) } 


(b) Equation (54) should read 


U 1 
1 

U 


> = 


(R 1 - R j ) { r 1 yi - r' y 1 ± 

± [ R 1 R j ( (y 1 - yV + 


+ 


(R ‘. R ) } log( i. ^_) ]* } 

K 2 
P z 

1 


^ j r) 


if R ^ R 


or 


= r^ = 0 (y 1 + y ^ 


if R* = R j 


(c) Equation (55) should read 


Pj f y (y | H 1 , A) = P, ? Y (y I H^.A), 


i = 1, .... M, j jf i * 
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OPTIMAL LINEAR AND NONLIIC'AR FEATURE EXTRACTION 
RASED ON THE MINIMIZATION Or THE INCREASED 
RISK OF KlSClAFSinCATION* 


Rul J. P. d e riiutltilo** 
Department of Electrical Engineering 
and 

DcpartMnt of Mathematical Sciences 
Rica University, Houston, Texas 77001 


Wa conaldar the problem of datanalnlng an optimal 
not nacaaaarlly linear trana format Ion A from a real n* 
dimensional aeaaure apace E n , In which the raw data to 
be classified into M (N^ 1) pattern claaaaa appear, to 
a "feature apace" E a of a preecrlbed dimension a < n, 

In which claaalf lea t Ion la to be made. The Dayea risk 
In the tranaformed apace E a , called the "lncreeaed risk 
of mlaclaaalf lcat Ion" , depends on A and hence will be 
denoted by q^(A) . We aaaua «e that A belonga to a given 
claaa \ of tranafoimatlona from E°— E a , each member of 
y being a preecrlbed function of a vector parameter 
a “ (a l ,...,a k ) characterizing the member. Por example, 
If x 1* the claaa of linear t rana format Iona , then mem- 
bera of x are conatant m x n matrlcca, the componenta 
of the vector parameter a characterizing a given matrix 
A consisting of the mn elements of that matrix. So,_ 
given an appropriate claaa x. ** aelect the optimal A 
by minimizing <)„(*) over all A€\. Neceaaary and suffici- 
ent conditions for the existence of such an A are given, 
and an Iterative algorithm for the determination of X la 
presented. Finally, the results obtained are particu- 
larized for the caae In which the statistics of the 
data are Causslan. 

l. iniiflditfiilgn 

Suppose that a data vector x • col(xj x n ) , 

belonging to the real n-dlmenslonal Euclidian space E n , 
la to be classified as pertaining to one of the M pat- 
tern classes H* H . Then x may be considered to be 

a realization of a random vector X - col (Xj X n ) . 

We will assume that X^, 1“1 n, are continuous random 

varlablea poaaeaalng well defined probability density 
functions. 

Por J-l M, let Pj denote the prior probability 

for the pattern class , and f x (./H^) the probability 
density function*** for X conditioned on the class H-J 
(called the likelihood function for the class HO. 

Note that, endowed with these probabilities and likeli- 
hood functions, E n becomes s measure space. 

We will assume that Pj , J-1,..,,M, are known and 
f x </H J >. J-l,... ,M, can be learned from available 
training seta. The functions f x (./HO, together with 
their first and second partial derivatives with respect 
to the components of x, will be assumed to be continuous 
and lntegrable on E n . 

Given an Integer m, such that 1 < m < n, let A be 
a function belonging to a given class x of functions 
from E n to E a (A) . Here the m-dlmenslonal Euclidian 
apaca E a la shown to be a function of A because the 
measure on E ,J (Introduced by the prior probabilities and 

^Supported In part by the NASA Contract NAS-9- 12776, 
the U.S. Army Contract No. DA-31-124-ARO-D-662 , and 
the NSP Grant GK- 36375. 

**Part of t'.ila work was performad while the author held 
a visiting research professorship at the Mathematics 
Rasearch Cantar of the University of Wisconsin at 
Madison, In tha academic year 1972-1973. 

***We will denote the function by f „( . /H-^ ) and Its 
value at x by f^x/H'J). 


the likelihood functions In E a ) Is dependant on the 
transformation A. 

In order to formulate the optimal feature extrac- 
tion problem, we need to be given one more entity, 
namely a criterion functional, whose velue correspond- 


ing to a given A will be denoted by 

Q(A; P t P M ; f x (./H l ) f^./H*)), (la) 

which, when the other arguments are clear from the con- 
text, will be written almply aa 

Q(A) . (lb) 


Then the optimal feature extraction problem may be 
stated precisely as follows: 

Problem 1 : Given Pj and f x (./H^), J-l M, a 

class Xi and a criterion functional Q, all defined as 
above, find X which minimizes* Q(A) over all A € % . 

In the existing literature (sec for example (1] 
through (6) and the references therein), solutions to 
Problem 1 have been obtained assuming Causslan statis- 
tics, using classes of llnesr transformations , and 
based on criterion functionals Q that are probabilistic 
distances, such as the divergence, the Bhattacharyla 
distance, and the Matuslta dlstence. In general, such 
distances lead to solutions that are at best subopw lma 1 , 
that is, these solutions minimize a bound on the risk 
of mlsclasalflcatlon rather than the risk of mlaclaesl- 
flcatlon Itself. 

*n what follows, we propr To solve Problem l by 
choosing the Bayes risk of t s - .self lcatlon, and In 
particular the probability r ^classification, as the 
criterion functional to be m. ilmlzed. While the pro- 
posed solution may require more computational effort 
than the solutions based on probabilistic distances 
mentioned above, It (the proposed solution) Is believed 
to be of great value for the following two reasons: 

(1) the feature extraction computation Is s "design 
computation" which Is performed off-line and only once 
and hence the greater computational effort which may 
be required does not constitute a basic limitation; 
and (2) the propoaed solution would give the maximum 
possible accuracy In claaalf lcat Ion achievable In a 
space of a prescribed dimension m. 

2. feature Extrsctlon Based on tha Mlnl- 
mizatlor, 0 f the Increased. AlAk Of 

flgblcm 

If A la a transformation which sends x € E n to 
y € E™ we may write 

y - col(yj y^) - A(x) - col(A l (x),. . . ,A b (x)) . (2) 

Por convenience, we will use the notation 

x - col (x j x n ), (3) 


*If the criterion functional Involves a probabilistic 
distance measure to be maximized (rather than mini- 
mized), such as the divergence or the Bhsttacharyla 
distance, we define Q to be the negative of such a 
distance, and minimize Q. 
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Ainvab ho(m jo 

9T flOVd TVNIOIHO 



« “ col(x 


•4-1 ■« )t 

and thua expreaa 

A(x) ■ ACS', T) , 

* t («) - A t («, *) , 1 • 1 a. 


Lac ua lntroduca tha Jacobian datanalnant 


J A («)t J A («, «> * 


aA l («,?) 

dx, 


AA^lx.x 1 ) 

dx. 


dA^x,?) 

dx_ 


dA^Cx.x) 

dx 


(*) 

(») 

( 6 ) 


(7) 


From now on w« will assume that the claaa y con* 
allta of fnot neceaaartli linear ) trana format Iona A 
from E n to E* auch that: 

(a) The (pure and mixed) aecond partial deriva- 
tive* of A(x) with reapect to the componenta of x are 
continuous; 

(b) Except poaalbly on aubaeta of E n where all the 
llkallhood functlona vanlah, the mapping under (2) of 
S’ to y la one-to-one for every x; and In particular, 
J A (x)dO everywhere, except poaalbly on the above aub- 
aeta of E 1 *. 


Under condltlona (a) and (b) above, we may, In the 

region of lntereat, express' the variables x :: n 

In terma of y^,...,y B> and * 0 *; . . . . ,x„ by Inverting 
(2). Specifically, there la a unique t ranr formation 
B: E n -»." auch that 

x • B(y .x*) (8) 

for all x, x, and y aatlafylng (2) In the region of 
lntereat. According to a wellknown procedure 8 , the 
likelihood functlona In E m (A) would then oe given by 


ac ao 

f y (y/H J ,A)-; d« n ; dx nl .. 


„• ^ x (B(y ,?) ,X/H^ ) 

1® B>4 ' 1 iJ A (B(y,x'),x‘)l 


(9) 

Remark 1 : At the expense of complicating our pre- 
sentation but othcrvlae adding no difficulty to our 
formulation, we could have enlarged the class y of 
transformations defined above by means of the two weaken- 
ing condltlona: (I) Allow the class y to Include all 
transformations A for which the vector x consists of 


any combination of m variables from the set (x 


1 » • 


,*n) 


(rather than only the first m variables from this set) 
provided condltlona (a) and (b) , with appropriate ara- 
ttenaments In notation, are satisfied. (II) Weaken con- 
dition (b) ao that for a given 7 and y the equation 
A()T,i^ • y ti permitted to have a finite number of mul- 
tiple roots, say S* l > eg' (y ,?) S< k) = B ( (y ,>?) , 

In a atandard way, the Integrand In (9) would be re- 
placed by 


k 

t 

t-l 


I J A (B ( ^(y,x’) ,?) I 


f x (B (,) (y,x) , x7h^) . 


( 10 ) 


For l,j - 1 M, 


(End of Remark) 
let the nonnega t lvc number K\ t 


represent the cost of classifying a data «ectc 
arising from H 1 when actually It originated ft 


tot ■ £ 

■rising from H l when actually It originated from . 
Again for simplicity In presentation and without loss 
of generality, we will assume that there Is no cost 
Involved In making a correct decision, l.e. that 


’ll 


- 0. 


It la a wall known and easily proved fact that, 
due to the reduction In dtamnalona 1 lty In going from 
E to E (A) , the Bayea risk of mlaclaaslflcatlon In 
E a (A) , denoted by 0 (A) , is greater than that In E n . 

For thle reaaon, 0 (A) will be called the ln< raaaed 
rlik of mt aclaaalfT cation and Is expressed by 

H . 

Q_(A) - E , f t (y.A) dy. (11) 

^ ‘- 1 0 t (A) 1 

where 

H I 

Vy.A) ’ ^ c tJ »y Y (y/H J .A),l-l M, (12) 

)*l 

and Cl (A) , 1*1 , . . . ,M, are decision regions In E*(A) , 
that is, If y € Cl^ (A ) one says that It arose from II 1 . 
Elementary decision theory also tells ua that 

(for a given A) the choice of "^(A) , l«l H, which 

minimises Q^fA) la given by 

0 t (A) - (y € E"(A):f l (y.A) < (j(y.A) , Jdl), 

1 • 1 M, (13) 

and, In the particular case In which the coat constants 

C D - 1 * •jj.l.J-i M, (U) 

where * Kronecker delta, (11) becomes the proba - 
bility of mtac lass i 1 Icat lpi. . (12) and (13) then re- 
ducing respectively to 

t 1 (y.A) ■ Ji Pjf y (y/H J ,A), (15) 

jdl 

and 

n i (A)-{y€E m (A):P 1 f Y (y/H J ,A)>P J f Y (y/H J ,A),jdl). 

1 m 1 M (16) 

We are thus able to reformulate Problem 1 as 
follows: . 

P roblem 2 : Clven Pj and f x ( . ) , j»l , . . . ,M , the 

class y of functions from E' 1 to E m («) satisfying con- 
ditions (a) and (b) above, and the crlteilon functional 
q B defined by (11), find A € v which minimizes Q m (A) 
over all A ( 

In order to slmplli- analysis, we will Intro- 

duce two additional cc < V (c) and (d) to be stated 
below. 

(c) Every transform ' ■ A belonging to x la *** 
presaible as A(x)- cp (x , j) , wnere a-col (a j,... .a^), belon- 
ging to a compact subset ^ of E k , Is a real parameter 
vector and «p Is a fixed function from E n+ *'' to E n ; In 
other words, each member of \ is obtained by assigning 
a different value to the parameter vector a in the ar- 
gument of the known function sf(x,-). We will assume 
that <p has continuous second partial derivatives with 
reapect to the components of x and a; and that* 
l Y (y/H J ,a), df Y (y/H J ,a)/?a |) , and » 2 f (y /H J ,a) /da p ** q , 
p,q«l,,..,k, are continuous and lntegrable In the 
product spaces spanned respectively by the variables 
y. y and a , and y, a and a . 

’ P P 1 

Remark 2 : The abrve condition Is not too restric- 
tive. For example the class of all linear transforma- 
tions from E n to E , whose representation consists c i 

In view of the conJltlon Just assumed we will trom now 
on replace capital A by small a In the notation appear- 
ing In (9) through (16), e.g. f^y/H-l.a) Instead of 
f Y (y/H J ,A), except when A denotes a matrix. 
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(24) 


ltd real constant Matrices of bounded nor* 
sat la f las (c) . In fact, tha numbs t k of parameters In 
this caaa la simply tha total number ■ > • of entries 
In any sue i nat rig . 

(End of Remark) 

If t H • function of y and a let Its gradients 
with respect to these vectors be defined In tha usual 
way: 


: _a_ 

dy i 


_a_ 

*(y.*)). 

(17a) 

iJ- 

*i 

♦(y.«)... 

_a_ 

*(y ,•)) . 

(l 7b) 


Denote by (a) th * boundary betwean (^(a) and Cl (a) , 
that La J 1 

« tJ (a) - (y € E^faltC^y.a) - ^(y.e), 

^ t (y.«) £ < p (>.*).P * l,J), (1» 

and Call «(a) the union of S^(e),lal H.Jdl. 

I i order to avoid singular points In tha descrip- 
tion of fi(a), we require that 

(4) For every nonrero af If, and 1*1 M, Jdl, 

v y <t 1 (y,.)-t J (y,a)) 4 £, y € « tJ . (19) 

From now on, we will consider, Instead of Problem 

2.1 

Problem 3 : Same as Problem 2 with the additional 
restrictions (c) and (d). 

3. Necessary and Sufficient Conditions 
for an Optima) Transformation 


Consider the Hessian matrix* 

HaV.) - V , v * V -> • 


(20) 


We first as.ert that jnder the above conditions, 
and can be evaluated by appropriately 

carrying out t .a differentiation operations under the 
Integral ilgr , 

iSS Under the conditions stated, 

M . 

v . <U*> ■ - <*y<Vt<y.*>>- ( 2 i> 

• ^ l-l 0 t (a) • 1 

f rpqf . Define the function (.:E k4fl - E 1 by 
l.t;v,e) - t 1 (y,a) , y € fi t (a) or 

y « . 

1 - I M.j 4 l. (22) 

f i on our conditions, l la continuous on e'‘‘ hb and, 
for any given a and 0^(a) the first and second partlala 
of l wltn respect to the components of a are continuous 
on 0 (el ant' approach continuous limits as y tends to 
the bounds of Q^a). On the boundary, the partlala 
have a simple discontinuity. Since ®<a) la of Lebesgue 
measure ter; In E n (a) we may write 
M „ . _ 

Q_<«) • z. : <>y • a *<y.«)dy. ( 23 ) 

1-1 n t (a) 1 E" 

where we ha/e purposely dropped the argument of E m (a) 
since It >.a Immaterial In this calculation. 

For eo> given Integer q, l < q < k , let 


dy * (>, V -• '.V 

> 

Since for every a,, . . ,a q .j .a^ , . . . .a^, the In- 
tegrand In (24) is integral le in the product apace 
spanned by the variably s and y, It follows, In 
voklng Fublnl's theorem In's standard way (with a 
an arbitrary teal constant and 7 a variable of 
integration) that 
a 

•» dJ q 8( *» “q-rV'q+i V 

qp 


qo 


*q~ a . *<*•*! Vl-Wl 

d, « J E . dy 

qn 1 q 

• „ v 

m 7 • m a - 


E” ''qo 9 


(25) 


% d yK<y.»>* 4 (y.*! Vi'VVi V 1 

E" 

- V*) * Vi %-1-VVi V- 

But from the leftmost and rlghiaxiat members of the 
equalities (25), we conclude that Q fl la the antideriva- 
tive (with respect to a .) of g or 


g(a) 


d ^(e. 


da. 


(26) 


(26) and (26) for q« 1 k, then eatabllah tha vali- 
dity of (21). Q.E.D. 

Lemma 2 . Under the conditions stated, 

M , T 

h - z - «*y( v .' 7 ‘ ( . (y.*)) - 

a i-i * • 1 


(27) 


l) 


where the second act of Integrals conalata of surface 
Integrals on the subsets of IK*) corresponding to the 
boundaries of not more than two regions Thus S (a) 
la that aubae>. of j (a) which la the coonon boundary 
bttween f.^la) and C)j(a) only. 

Proof . As In the previous proot, we will carry out 
the Integrations ovar the entire E n . However our pre- 
sentation will he simplified using Indicator functions 
for tha regions (^(a). 

In fact 1st tha funrt'on u.E -E be defined by 

». S’* 0 . 

u(?) - { (28) 

0 . !i« • 

Then, (21) may be rewritten as 
H . M 


(29) 


v a<L (•)- £ . <*y I ( n u(t.(y.a)-/ (y.a)))v. T <.(.,y)) 

™ 1-1 ft j-l 1 

i4l 

Now, provided we are willing to admit distribution 
functions, we may transfer further differentiation 
operations to within the Integral sign; that la 


•Henceforth t’.a superscript T on a symbol will 
denote its transpose. 
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V- 


Jq(«) - 5 X <*y( C ffi< # ,<y.«)~t 1 (y .•)>! . 

»"l E m J »1 J 1 

J*l 


n u('. (y,a)-< t (y,e>)| 

■ j * ral « ■ 


rdl 


{v* <y.«) I 


+ j5l“ <t J <y, * > * t l <y, * ))v « r » V y * 0, ‘ 


(JO) 


J-l 


where &(■) denote* the delta function. 

The leat product tern In (30) clearly Iced* to 
the flret Integral In (27). 

Conelder the remaining set of term*, all In equare 
bracket*, In (30). The flret (delta function) term In 
square brack* • reduce* the Integration of the act of 
term* under conalderatlon to a surface integral on 

If y € j ( a ) Is on the boundary of more than 
two decision regions, aay ^(a), r.j(e), and .^(a), 
the product term In the third set of square brackets 
vanishes because «(t.(y,a)-tj(y,a)) • 0 there. Thua 
points on Sjj(a) common to boundaries of more t* - .!! two 
decision regions make no contribution to the surface 
Integrals under discussion If y Is on the boundary 
common to > 1 |(a) and Dj(a) only, then the product term 
In the third set of square brackets Is unity there, 

Thua Integration on such points of ®jj(a) leads to 


1J 


, de 7 (t (y,.)-t .(y,a))V*A (y.a). 
(a) ■ J * 1 


(31) 


A similar calculation with the values of 1 and J 
Interchanged leads to a surface Integral of the form 

* ds V (t (y,a)-t (y,a))^t (y.a) . (32) 

S t j(a) 1 J J 

Thus the total contribution to the surface Integral 
made by j (a) is 

v , ?• VVy.‘>-Vy.*>' r I<Vy. •>-*.<*.•>>. o» 

j(a) 3 1 1 J 

which establishes the validity of the second (double 
summer Ion) term In (X7). Q.E.D. 

Remark 3 : Before proceeding any further, It should 
be pointed out that since V q la contlitious for almost 
all a and y, It follows from ¥he Lebesgue dominated con- 
vergence theorem^ that is continuous for all a. 

In order to describe tne set AT, let us now assume 
that we are given a function 0 of the variable a, with 
values In E q , where q < k, satisfying: 

(a) S has continuous second partial* and la such 
that |}9(a)|| - » as II *il - h 

Following convention, we say that a € E Is a regu- 
lar point of 0 If 7 0 (a) Is of rank q. 

We set 

X m (a : 9(a) ■ c ■ given constant vector). (34) 

Necessary and sufficient conditions for the exist- 
ence of an optimal a are formulated In terms of the 
following two theorems. 

Theorem 1 . Let conditions (a) through (a) hold 
and y be . eflned as In (34). Then: (1) Problem 3 always 
has a soli tlon; and ( 11 ) If a Is such a solution and a 
la a regular point of 9 , then there li i S ( I 1 * such 


Lhar 

H - . q • 

E . <*7 t,(y,i>> ♦ t k v w <i) - o. (33) 

i - i a t (i) 1 • • i • * * 

F»ool . X described by (34) Is clearly compact In E*. 
Since O is continuous on V, part ( 1 ) follows from 
Welerstrase's Theorem . Part (11) follows. from Lemma 1 
and wallknown optimisation theory results 1 . Q.E.D. 

Theorem 1 . Let the conditions as well as a and X 
be as described In the preceding theorem, a Is a local 
minimum* of Q^ on \ If there Is an a > 0 such that 

b 1 V T IH^QJ.a) + £ V 9 (a) | V b > cf|b(l 2 (36) 

*•1 

for all b ( E* t * q , where ii^O (a) la •» In (27), V Is a 
k*(k-q) matrix whose columns span the null space of the 
matrix p9(i)) T . 

Ptoof . Immediate from Lemma 2 and wellknown results 1 
from optimisation theory. Q.E.D. 

Remark 4 The following Is also clear from wellknown 
facts from optimisation theory. Let ( denote the subset of 
of (l. ...,qi such that 9 Is nonlinear If and only If 
1 € l ( 5 may be empty). By enlarging If necessary X, 
extend It to the set by replacing In (34) the equellty 
sign by < for those 9 with 1 € S . If 9 , 1 € { , are 

convex (and hence *(, Is convex) and u is convex on *(. , 
then a satisfying Tneorem 2 Is a poln¥ of global 
minimum of on X . 

Next we focus our attention on the following Impor- 
tant special case. 

4. Linear Feature Extraction of Gaussian Features 

Let us In fact particularise the results Just de- 
scribed to the case In which the statistics of the 
pattern classes are Gauss Ian and the class \ of trans- 
formations from E n to E” Is linear . Specifically, we 
let t be a compact subset of the real flnlte- 
dlmenslonal Inner product space ft) of real mxn matrices 
with the Inner product between any two elements A and 

B of ft| defined by 

^ M H 

(A,B> - tr(AB T ) - t CA B.., (37) 

l»l )“l 

where the abbreviation tr stands for trace. 

It Is a simple matter to show that (37) la a 
valid Inner product In ft?. 

If g Is a reel-valued function of a matrix-valued 
variable A belonging :o ft?, such that at some value of 
A, say X , g is di f lerent table with respect to the ele- 
ments of A, one can chow from the abet 'set definition 
of the gradient 1 ', that the gradient of g with respect 
to A, evaluated at X, Is simply tht ..icmber v .g(3T) of 
ft? whose lj-£l! clement Is 

. o.> 

Formulas for matrix gradients of various types of 
real-valued functions of matrices have been derived In 
reference (13). Recently, Decell and Qulreln° have 
used such formulas to express gradient*!*** of the 
divergence and Bhattacharyya distance In an easily 
computable fotm. 

In what follows, we apply the result..’ of Theorem 1 
to the Causslan case by first computing V ,fy(y/H ,A), 
at a given A ( *'/<. However, In applying £he tesults of 
Theorem 2, since the Hessian H A f (y/H~,A) Is a linear 
operator thar cannot be represented In matrix form wi- 
thout destroying the matrix structure of the element 
of ft< on which It Is acting, we obtain Instead the 
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matrix H f V (y/H^,A)», lor arbitrary b, and nanca, by 
Integration ,M A Q (X")B. Note that H ij (A)B 1* *11 that 
la naadad In VoRnactlon with (36) , where tha vactor 
Vb now corresponds to tha matrix B € .. 

H f (y/H',A)B la tha Gateau differential 1 ^ 

W l y' y ' M ^ ,A;B ) °* LLO/H , .) at A along B and 
la eallly cooputabla from tha formula 

Wy (),/ mJ, * ;,) ’ 

lim A [ V A * Y <y/H J .A ♦ tB) - ^ A f Y (y/H J ,it)J - (39) 

l jj * A f Y (y/H J ,Jt ♦ tB) ) t-0 , t fc E l . (40) 

Returning now to our original problem, under the 
normality hypotheala tha probability danalty functlona 
conditioned on pattern claesea, In the tranaformed 
apace E*(A) are 

.a .1 

< Y (y/H J ,A)-(2n) 2 |R J | 2 exp(-J <y-y J ) T (t J )' l (y-y J > ) . 

J-l M, (41) 


♦(v-y J ) T <R J )* l f9 A (Al J A T )|(« J )* l (yy J ) 

♦<y-y J ) T (t J )* l rj|y-Ax J )l 

- - 2(R J )' l (y-y J )(i i ) T 

- 2(R J )‘ l (y-y J )(y-y J ) T (t J r l A* j . (47) 

Subatltutlng (<•*) and (47) In (43), (44) la eatabllehed. 

Q.B.D. 

Thera la a number of waya one may latpoae conatralnta 
to guarantee compactness of One la to require that 

J II All » ^ tr (AA 1 ) • y, A € ^ const .(48a) 

Another la to allow only thoaa A € *» conalatlng of 
orthonormal row vectora by requiring that 

A A T - I, A € X. (48b) 

For almpllclty In preaentatlon, we will assiane 

X • (A € ?>i : A tat la 1 lea (48a)). (48c) 


where A li a ■ > n matrix (belonging no \) , I R * | danotea 
the determinant of rJ , and y 2 and are the mean and 
covariance pertaining to hJ In E m (A) , The latter two 
entitles are related to the given mean xJ and covariance^ 
7^ aaaoclated with In the original apace E n by 

y J - A ; J , (62) 

R J • A A T , (43) 


We will require: 

Leimm 3 Fot f y (y/H* ,A) , J-l M, a a In (41), 

y Y (y/H J ,A) • 

f ¥ (y/H J /)|((R J ) l (y-y J )(y-y J ) T -I)(R J ) l Air J +(R J ) l (y-;- J )(; J ) T ). 

(44) 


An additional lnportant conalderatlon concerning 
reatrlctlona on the act \ la apelt out In Remark 7, 
at the end of thla aectlon. 

By virtue of the above Lensa , Theorem 1 clearly 
reducea to: 

Theor em 3. Suppoae that in Problem 3 the pattern 
claaaaa i..,l - 1,.., M, are Cauaalan with meane and 
covariances x and ffJ.j ■ 1.....M, and x 1* •• (48c) 

Then the problem alwaya has a aolutlon. At any euch 
aolutlon A It la necessary that the following matrix 
equation be aatlafled 

" " c. P (|(R J )* 1 D l, -I)(R J )* 1 A* J 

1-1 j-l l J J 

J*i 

♦(R J )* l a lJ (;Vi Ja - o ( 49 ) 


Proof . From (41) we obtain 

^V y/H J,A) 


where I la the Identity matrix, X la a Lagrange multi- 
plier (to he calculated ualng (4Pa), 


.((2n)^exp(-j(y-y J ) T (R J )‘ l ( > .-yJ) |) ( [^ | AlT J A T | 7 ) 

- jl* J | ^ A ((y*Ai J ) T (AR J A T )* l (y-A; J ))|). (45) 

The term In the tlrat set of square brackets gives 
.1 . 1 
* a |a» j a t | 2 - - \ I r j | 2 v A | R-» | 

- - "J |R J | 2 |R J |tr[(R J )' 1 V A (A*A T ) | 


|R J | 2 (R J )' l Aff J , 


(46) 


D lJ <y-y J )(y-y J ) T i v (y/H J ,A) dy, (50) 

n t (A) T 

d lJ - (y-y J ) f v (y/M J ,A) dy , (51) 

n,(A) Y 

and the circumflex on a symbcl denotes that the cor- 
resfondlng quantity la to be calculated using A. 

Remark 5 . The iterative algorithm outlined In the 
following section requires the left aide of (49) tg be 
computed at every Iteration using the estimate of A 
obtained In the preceding iteration. At first sight, 
the need for the evaluation of the multiple Integrals 
In (50) and (51) at each Iteration might appear as a 
aerlous drawback of this procedure. However, thla dif- 
ficulty can be avoided If we notice that D*J and dO 


where here, as In what follows, we use the fact that 
7' Is symmetric; while the term In the second set of 
square brackets leads to 

V (y-A; J ) T (A7 J A T )* l (y-A; J ) ) 
-(V A (y-A; J ) T ](R J )' 1 (y-; J ) 


We assume R^ nonsingular and 
the particularization to the 
conditions do not hold being 


are In fact proportional to the covariance and mean of 
the random variable (yJ-yJ) (where Y^has the density 
fy(./H J ,A) ) when restricted to the decision region 
ft. (A). Thus If trslnlng sets for the various pattern 
classes are available, the Integrals In question may be 
replaced by sample averages over appropriate subaets of 
training sets. Specifically, suppose that, for 
f“l H, pertaining to we have the training set 


* J * * P . J* 2 M. P*j. 

special cases when these 
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A «J 1 y>"J - A , whe re x J 1 , . J are the 

glvan training samples In E . Than we hava for estl> 
mates of D*^ and d ^ 


D lJ - -j- I (y^ q -y J )(y* q -y*) T , 

y Jq € r H (A> 




i lJ - -7T £ (y Jq *y^) 

N J la 

* ^t(A) 


(53) 


Remark 6 . In tha particular caae In which m-1, 

Y - ? A..X, being a acalar random variable, (50) and 

J>1 IJ J 

(SI) become single (rather than multiple) Integrate and 
can therefore be easily computed without one having to 
go through the route described In the preceding remark. 
Let us assume that the risk Is the probability of mls- 
c la as 1 f lea t Ion and hence (14) holds, then the decision 
boundaries are real numbera which are choaen, according 
to (16), from among the roota 

M 


•}*J 


J i P 2 l 

±(«V(j:y l +y J > 2 +(R l -R J >log((*j) 2 -^>)r ) 

R pf 


(54) 


(55) 


l-l M, jdt , 

of the equatlona 

f Y (y/H l .A)-f y (y/H J .A) ,1-1 M, jdl, 

theae densities being described by (41). Suppose that 
the l£E decision region consists of an Interval 
oj < y < o* , Then a trivial calculation resulting 
from the substitution of (41) In (50) and (51) leads to 

D lJ -R J ;(2ti) ? [e[ J eKp(-A(B[ J ) 2 )-B5JKp(-^(B‘ J ) : )l 


♦ Erf(0,l;Bj J ) - Erf (0 , 1 ; b| J ) ) , 
d lJ -(^) 2 fe*p(--J(B[ J ) 2 )-e*p(-j(Bj J ) 2 )), 


where 

Erf (0, 1 ; a) 


-i -1 

.1 ’ 2 


(2")' 


d » . 


(56) 


(57) 


(58) 


b[ j -(r j ) 2 (o[-/ j ),b| J '(r j ) 2 o»5 -r). 

(End of Remark) 


(5«A) 


To verify the conditions of Theorem 2 under the 
Gaussian hypothesis, we first calculate (omitting the 
derivation) the term appearing under the first sunsnatlon 
In (27): 

t t (A,B) - 


Cl t (A) 

l 

jdl 

•2(R J )* l (BJ*A T sAff J B T )(R J )' l n 1J (R J )* l Alf J 

4(rV V J (rV 1 BR J 

-(R J )‘ , (BK J (d 1J ) T >-(d 1J ); J B 7 )(R J )‘ l Al J 

■(R J )* l (BS rJ A T -fA(lr J ) T B T )(R J )' l d li (; J ) T 

-(r j )* 1 b(r j s.; j (; j ) t ) 

♦ ( R J ) ‘ 1 ( BlT J A T +A]r J B T ) (R J ) ' 1 Alf J 

-tr|(R J )' l Aff J B T ||((R J )' l D 1 J -I)(R J )' l Ar J S-(B J )* l d li (; J ) T ] 

tc ! *(B)(R J )' l A? ! 

-V 1J (B)(R J )' l At J 


+h‘ J (B)(i J ) T 


♦F lJ (B)(R J )* l A?t J 


-(d tJ ) 1 (R J )' 1 B; J (R J )‘ , Al 2 

♦v 1J (B)(J J ) T ) , (60) 

where 

<P 1J (B)-. dy f y (- /H J ,A)(y-y J ) T (R J )* l Aff J B T (y-y J ) 

n t (A) (6i) 

h tJ (BK dy My/H J ,A)(y.y J ) T (*V l • 

n t (A) 

Aff J B T (y-y J )(R J ) l (y.y^) (62) 

v lJ (B)-.’ dy f v (y/H A)(yy J ) T (R J )‘ l g;-l(gJ)* l (y.ji) 
fiy(A) ’ 

(63) 

F 1J (B)-; cy f ¥ (y/H J ,A)(y-J J ) T (RJ)‘ l »;J • 
f) t (A) Y 

(R J ) l (y-y J )(y-y J ) T (64) 

C* ’(»)-„" dy f Y (v/H J ,A)(y-; J ) T (R J )' 1 A* J B*(y-y J ) . 

O t (A) 

(•^"’(y-y^Hyy^) 1 . (65) 

Since all the above Integrals are expectations, 
they may be computed from the training samplea In the 
same way as (52) and (53). 

The surface Integral terms In (27) are calculated 
in a similar fashion. In the special caae in which the 
risk Is the probability of misclasstf icallon, these 
terms (taking Into account the minus sign that precedes 
the double summation) reduce to 
M-l H 

£ £ N lJ (A), (66) 

1-1 J-l+l 

where 

page I* 

q\' A 1 ‘ 1 
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H lJ (A> d.liv f (y/H l ,A) - V f (y/H J .A)J. 

» tJ (A) A 

l^ A f Y (y / H l .A ) - V A l y (y/H\A)| T . (6T) 


where for ? f^fy/H 1 ,*)! I ■ 1 we have to use 

the expression given by the right old* ol (44). than 
quantities similar to those In (61) to (63) result 
which again can be computed from the training aaotplee. 
Theorem 4 . If t satisfies the conditions of Theorem 3, 
then satisfaction of (36) Is aqulval nt to the require* 
ment that 

M- X M 

tr (l E ( L.(A.b) ♦ E N 1 ^ (A ) B ) ♦ ^.(A.b) 

1-1 1 J-la-1 


♦ ill U T ) £ *tr( B| T ) (48) 


for every B € *i such that the transpose of the f*” 
ryw of B lies In the null space of the p l “ column 
A , p — 1^ ... .m. 


th 


of 


Remark 7 ; Let £ denote a set containing ai distinct 
elements from |l,,..,n) and denote by A the )*" co- 
lumn of A. For some gt<-en ^ , we may wlfn to restrict 
the class > so that for every A € ?( belonging to x, the 
columns A ., J € ^ constitute a submatrix of rank m. 
Since 0 If ^invariant with respect to non-singular coor- 
dinate transformations In the transform * (feature) 
space, we may, In the case under const us; at Ion, set 
(A . : J € ^ ) equal to a permutation P ot the columns 
of-*the unit m * m llagonal matrix, leaving the a(n-m) 
elements In the remaining columns free lor the optimi- 
zation procedure. We would optimize these elements by 
requiring that the corresponding elements In the matrix 
equation (49) satisfy (49) (with the remaining elements 
of A set equal to the elements of P). 

The above remark Is particularly useful In the verl- 
flcetlon of the sufficient conditions. Fog suppose 
that to begin with we let all elements of A very (since 
we didn't know at the outset which columns of A had 
rank m) and thereby determined A by means of an Iterative 
procedure based on (49). Assume chat then we find that 
the first a columns of A constitute a non-singular matrix 
A. . Thus let A - ( A. , A 
(I, A * l A.) end restrict 

clancy conditions with repect to a smaller subset of 
matt Ices B than otherwise required ( since the only 
relevant columns of A are the lest n • m columns). 


). We may then replace A oy 
the verification of the suffl- 


5. compute lonal Algorithm 


Various Iterative methods such as the Newton and 
gradient mechoda and their numerous modifications are 
available for the computation of the optimal feature 
extraction transformation. We will limit our discus- 
sion to the case of linear feature extraction of 
Gaussian features, our remarks extending trivially to 
the general non-Gausslen nonlinear cate treated In 
Section 3. 

The values A and X of the estimates of the op- 
‘ me 1 matrix A anB Lagrange multiplier X at the pit 
.erat Ion are given, according to (19) and (68c) by 


A -A ,-K | 


■s -n . 

p p-1 


M 

E c 


P 1-1 j-1 IJ J 
J*l 




J rVi . 


P -r 


p-i 


n. 


■ ( *p. .) " l * 

p-i r 


i p-rvivi 




J 


(iVl-X 




p-i p-i 


<69e) 


p-1 


VxV 


1 1 '<\ 


i A p-i ) ’ Yl ' 


p-i... 


((9b) 


where ell the tyntoU subscripted with p-1 are to be 
c.asputed with the values A , and 

th p-1 


(P-I,** 


X , obtained at 

p-i p-l 

, r Iteration, V and p are variable matrix and 

scalar g»lna date rallied according to the Iterative 
method selected, the Initial estimates A p and X^ ere 

set at convenient values, end D 1 ^. and J*J, are ob- 

P-1 P-1 

talned by (36) and (97) If the dimension si of the fea- 
ture space le one, and by (32) and (33) otherwise. 

Note that a convenient way of writing (3?) and (33) la 


d 1 '--L 2 

N-l q-1 


(y lq -y J )(y Jq -y*) T A, 


(y Jq ), 


(70) 


d lJ --^r E (y Jq -y*)A .(y Jq ) , (71) 

N * q-l 1 

whe re 

N 

A.(y ) - II u(((y,A)-((y,A)) (72) 

* J-1 J * 


We have used the Dsvldon-Fletcher-Powell^ (D-F-P) 
method In Implementing (69a, b) on the IBM 370/133 
computer ol the Rice University Institute for Computer 
Services snd Applications (ICSA). The D-F-P procedure 
requires finctlon evaluations In sdditlou to gradient 
evaluations. Evaluation of (^(A) at the piu iteration 
may be carried out by the following estimate Justified 
In the asme way aa (32) and (33): 


w- 


M 

r 

i-i 


M 

)-i 

JAI 


_L 

nJ 


nJ 

E A (y Jq ), 
q-l 1 


( 73 ) 


If one knows beforehand that certain feature* are 
significant, they may be retained rhua reducing the 
number of pare.neters of the matrix A to be determined 
according to Remark 7. 


6. Conclusion 

Ceneral classes of nonlinear anc linear transfor- 
mations A for tne reduction of the dlmene lone 1 Ity of 
the c laaal f lcatlon (feature) spc. e eo Chat, for e pre- 
scribed dimension m of this space, the Increase of the 
mleclaseltl :etlon tlsk Is minimized, have been Investi- 
gated. Necessary conditions that must satisfied by the 
optimal A have been presented and eutflclent condition* 
for * local minimum have been Indicated, Even though 
the sufficiency conditions ate very complicated, the 
necessary conditions lend themselves to the formulation 
of Iterative algorithms fot the determination of the 
optimal transformation. In the proposed approach, the 
multiple Integrals which appear at each step 1 the 
Iteration are replaced by certain aample averages over 
training sets, a procedure which permits the carrying 
out of the required computations with reseonable 
amount of effort even for veluee of m not too low. 

Testing of the proposed method on remotely earned 
date provided by the Johnson Space Center Earth Obser- 
vations Division le In progress at Rice University 
ICSA computer facilities, and Ihe naxerlcal result* 
obtained will be dlscuaaed in a separate paper. 
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